Intro from Skeleton Code (with some edits to add more features to subset):
setwd('/Users/robin/Desktop/Assignment\ 1\ MSDS\ 410')
mydata <- read.csv(file="ames_housing_data.csv",head=TRUE,sep=",")
str(mydata)
## 'data.frame': 2930 obs. of 82 variables:
## $ SID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PID : int 526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
## $ SubClass : int 20 20 20 20 60 60 120 120 120 60 ...
## $ Zoning : chr "RL" "RH" "RL" "RL" ...
## $ LotFrontage : int 141 80 81 93 74 78 41 43 39 60 ...
## $ LotArea : int 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "IR1" "Reg" "IR1" "Reg" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Corner" "Inside" "Corner" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "NAmes" "NAmes" "NAmes" "NAmes" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "1Story" "1Story" "1Story" "1Story" ...
## $ OverallQual : int 6 5 6 7 5 6 8 8 8 7 ...
## $ OverallCond : int 5 6 6 5 5 6 5 5 5 5 ...
## $ YearBuilt : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ YearRemodel : int 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
## $ RoofStyle : chr "Hip" "Gable" "Hip" "Hip" ...
## $ RoofMat : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1 : chr "BrkFace" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ Exterior2 : chr "Plywood" "VinylSd" "Wd Sdng" "BrkFace" ...
## $ MasVnrType : chr "Stone" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 112 0 108 0 0 20 0 0 0 0 ...
## $ ExterQual : chr "TA" "TA" "TA" "Gd" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "CBlock" "CBlock" "CBlock" "CBlock" ...
## $ BsmtQual : chr "TA" "TA" "TA" "TA" ...
## $ BsmtCond : chr "Gd" "TA" "TA" "TA" ...
## $ BsmtExposure : chr "Gd" "No" "No" "No" ...
## $ BsmtFinType1 : chr "BLQ" "Rec" "ALQ" "ALQ" ...
## $ BsmtFinSF1 : int 639 468 923 1065 791 602 616 263 1180 0 ...
## $ BsmtFinType2 : chr "Unf" "LwQ" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 144 0 0 0 0 0 0 0 0 ...
## $ BsmtUnfSF : int 441 270 406 1045 137 324 722 1017 415 994 ...
## $ TotalBsmtSF : int 1080 882 1329 2110 928 926 1338 1280 1595 994 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Fa" "TA" "TA" "Ex" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ FirstFlrSF : int 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
## $ SecondFlrSF : int 0 0 0 0 701 678 0 0 0 776 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ BsmtFullBath : int 1 0 0 1 0 0 1 0 1 0 ...
## $ BsmtHalfBath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 1 1 1 2 2 2 2 2 2 2 ...
## $ HalfBath : int 0 0 1 1 1 1 0 0 0 1 ...
## $ BedroomAbvGr : int 3 2 3 3 3 3 2 2 2 3 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ KitchenQual : chr "TA" "TA" "Gd" "Ex" ...
## $ TotRmsAbvGrd : int 7 5 6 8 6 7 6 5 5 7 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 2 0 0 2 1 1 0 0 1 1 ...
## $ FireplaceQu : chr "Gd" NA NA "TA" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Attchd" ...
## $ GarageYrBlt : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ GarageFinish : chr "Fin" "Unf" "Unf" "Fin" ...
## $ GarageCars : int 2 1 1 2 2 2 2 2 2 2 ...
## $ GarageArea : int 528 730 312 522 482 470 582 506 608 442 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "P" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 210 140 393 0 212 360 0 0 237 140 ...
## $ OpenPorchSF : int 62 0 36 0 34 36 0 82 152 60 ...
## $ EnclosedPorch: int 0 0 0 0 0 0 170 0 0 0 ...
## $ ThreeSsnPorch: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ScreenPorch : int 0 120 0 0 0 0 0 144 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA "MnPrv" NA NA ...
## $ MiscFeature : chr NA NA "Gar2" NA ...
## $ MiscVal : int 0 0 12500 0 0 0 0 0 0 0 ...
## $ MoSold : int 5 6 6 4 3 6 4 1 3 6 ...
## $ YrSold : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ SaleType : chr "WD " "WD " "WD " "WD " ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Normal" ...
## $ SalePrice : int 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
head(mydata)
## SID PID SubClass Zoning LotFrontage LotArea Street Alley LotShape
## 1 1 526301100 20 RL 141 31770 Pave <NA> IR1
## 2 2 526350040 20 RH 80 11622 Pave <NA> Reg
## 3 3 526351010 20 RL 81 14267 Pave <NA> IR1
## 4 4 526353030 20 RL 93 11160 Pave <NA> Reg
## 5 5 527105010 60 RL 74 13830 Pave <NA> IR1
## 6 6 527105030 60 RL 78 9978 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1 Lvl AllPub Corner Gtl NAmes Norm Norm
## 2 Lvl AllPub Inside Gtl NAmes Feedr Norm
## 3 Lvl AllPub Corner Gtl NAmes Norm Norm
## 4 Lvl AllPub Corner Gtl NAmes Norm Norm
## 5 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 6 Lvl AllPub Inside Gtl Gilbert Norm Norm
## BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodel RoofStyle
## 1 1Fam 1Story 6 5 1960 1960 Hip
## 2 1Fam 1Story 5 6 1961 1961 Gable
## 3 1Fam 1Story 6 6 1958 1958 Hip
## 4 1Fam 1Story 7 5 1968 1968 Hip
## 5 1Fam 2Story 5 5 1997 1998 Gable
## 6 1Fam 2Story 6 6 1998 1998 Gable
## RoofMat Exterior1 Exterior2 MasVnrType MasVnrArea ExterQual ExterCond
## 1 CompShg BrkFace Plywood Stone 112 TA TA
## 2 CompShg VinylSd VinylSd None 0 TA TA
## 3 CompShg Wd Sdng Wd Sdng BrkFace 108 TA TA
## 4 CompShg BrkFace BrkFace None 0 Gd TA
## 5 CompShg VinylSd VinylSd None 0 TA TA
## 6 CompShg VinylSd VinylSd BrkFace 20 TA TA
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1 CBlock TA Gd Gd BLQ 639
## 2 CBlock TA TA No Rec 468
## 3 CBlock TA TA No ALQ 923
## 4 CBlock TA TA No ALQ 1065
## 5 PConc Gd TA No GLQ 791
## 6 PConc TA TA No GLQ 602
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1 Unf 0 441 1080 GasA Fa Y
## 2 LwQ 144 270 882 GasA TA Y
## 3 Unf 0 406 1329 GasA TA Y
## 4 Unf 0 1045 2110 GasA Ex Y
## 5 Unf 0 137 928 GasA Gd Y
## 6 Unf 0 324 926 GasA Ex Y
## Electrical FirstFlrSF SecondFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1 SBrkr 1656 0 0 1656 1
## 2 SBrkr 896 0 0 896 0
## 3 SBrkr 1329 0 0 1329 0
## 4 SBrkr 2110 0 0 2110 1
## 5 SBrkr 928 701 0 1629 0
## 6 SBrkr 926 678 0 1604 0
## BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1 0 1 0 3 1 TA
## 2 0 1 0 2 1 TA
## 3 0 1 1 3 1 Gd
## 4 0 2 1 3 1 Ex
## 5 0 2 1 3 1 TA
## 6 0 2 1 3 1 Gd
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1 7 Typ 2 Gd Attchd 1960
## 2 5 Typ 0 <NA> Attchd 1961
## 3 6 Typ 0 <NA> Attchd 1958
## 4 8 Typ 2 TA Attchd 1968
## 5 6 Typ 1 TA Attchd 1997
## 6 7 Typ 1 Gd Attchd 1998
## GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1 Fin 2 528 TA TA P
## 2 Unf 1 730 TA TA Y
## 3 Unf 1 312 TA TA Y
## 4 Fin 2 522 TA TA Y
## 5 Fin 2 482 TA TA Y
## 6 Fin 2 470 TA TA Y
## WoodDeckSF OpenPorchSF EnclosedPorch ThreeSsnPorch ScreenPorch PoolArea
## 1 210 62 0 0 0 0
## 2 140 0 0 0 120 0
## 3 393 36 0 0 0 0
## 4 0 0 0 0 0 0
## 5 212 34 0 0 0 0
## 6 360 36 0 0 0 0
## PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1 <NA> <NA> <NA> 0 5 2010 WD Normal
## 2 <NA> MnPrv <NA> 0 6 2010 WD Normal
## 3 <NA> <NA> Gar2 12500 6 2010 WD Normal
## 4 <NA> <NA> <NA> 0 4 2010 WD Normal
## 5 <NA> MnPrv <NA> 0 3 2010 WD Normal
## 6 <NA> <NA> <NA> 0 6 2010 WD Normal
## SalePrice
## 1 215000
## 2 105000
## 3 172000
## 4 244000
## 5 189900
## 6 195500
names(mydata)
## [1] "SID" "PID" "SubClass" "Zoning"
## [5] "LotFrontage" "LotArea" "Street" "Alley"
## [9] "LotShape" "LandContour" "Utilities" "LotConfig"
## [13] "LandSlope" "Neighborhood" "Condition1" "Condition2"
## [17] "BldgType" "HouseStyle" "OverallQual" "OverallCond"
## [21] "YearBuilt" "YearRemodel" "RoofStyle" "RoofMat"
## [25] "Exterior1" "Exterior2" "MasVnrType" "MasVnrArea"
## [29] "ExterQual" "ExterCond" "Foundation" "BsmtQual"
## [33] "BsmtCond" "BsmtExposure" "BsmtFinType1" "BsmtFinSF1"
## [37] "BsmtFinType2" "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF"
## [41] "Heating" "HeatingQC" "CentralAir" "Electrical"
## [45] "FirstFlrSF" "SecondFlrSF" "LowQualFinSF" "GrLivArea"
## [49] "BsmtFullBath" "BsmtHalfBath" "FullBath" "HalfBath"
## [53] "BedroomAbvGr" "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd"
## [57] "Functional" "Fireplaces" "FireplaceQu" "GarageType"
## [61] "GarageYrBlt" "GarageFinish" "GarageCars" "GarageArea"
## [65] "GarageQual" "GarageCond" "PavedDrive" "WoodDeckSF"
## [69] "OpenPorchSF" "EnclosedPorch" "ThreeSsnPorch" "ScreenPorch"
## [73] "PoolArea" "PoolQC" "Fence" "MiscFeature"
## [77] "MiscVal" "MoSold" "YrSold" "SaleType"
## [81] "SaleCondition" "SalePrice"
mydata$TotalFloorSF <- mydata$FirstFlrSF + mydata$SecondFlrSF
mydata$HouseAge <- mydata$YrSold - mydata$YearBuilt
mydata$QualityIndex <- mydata$OverallQual * mydata$OverallCond
mydata$logSalePrice <- log(mydata$SalePrice)
mydata$price_sqft <- mydata$SalePrice/mydata$TotalFloorSF
summary(mydata$price_sqft)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.37 100.57 120.43 121.60 140.01 276.25
hist(mydata$price_sqft)
subdat <- subset(mydata, select=c("TotalFloorSF","HouseAge","QualityIndex",
"price_sqft", "SalePrice","LotArea",
"BsmtFinSF1","Neighborhood","HouseStyle",
"LotShape","OverallQual","logSalePrice",
"TotalBsmtSF","HouseStyle","Zoning","LotShape","SaleCondition","Functional", "LotArea","SubClass","LotFrontage","OverallCond", "YearBuilt", "ExterQual", "ExterCond", "FirstFlrSF", "SecondFlrSF", "BedroomAbvGr", "TotRmsAbvGrd", "GrLivArea", "MiscVal", "YearRemodel"))
str(subdat)
## 'data.frame': 2930 obs. of 32 variables:
## $ TotalFloorSF : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ HouseAge : int 50 49 52 42 13 12 9 18 15 11 ...
## $ QualityIndex : int 30 30 36 35 25 36 40 40 40 35 ...
## $ price_sqft : num 130 117 129 116 117 ...
## $ SalePrice : int 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
## $ LotArea : int 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
## $ BsmtFinSF1 : int 639 468 923 1065 791 602 616 263 1180 0 ...
## $ Neighborhood : chr "NAmes" "NAmes" "NAmes" "NAmes" ...
## $ HouseStyle : chr "1Story" "1Story" "1Story" "1Story" ...
## $ LotShape : chr "IR1" "Reg" "IR1" "Reg" ...
## $ OverallQual : int 6 5 6 7 5 6 8 8 8 7 ...
## $ logSalePrice : num 12.3 11.6 12.1 12.4 12.2 ...
## $ TotalBsmtSF : int 1080 882 1329 2110 928 926 1338 1280 1595 994 ...
## $ HouseStyle.1 : chr "1Story" "1Story" "1Story" "1Story" ...
## $ Zoning : chr "RL" "RH" "RL" "RL" ...
## $ LotShape.1 : chr "IR1" "Reg" "IR1" "Reg" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Normal" ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ LotArea.1 : int 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
## $ SubClass : int 20 20 20 20 60 60 120 120 120 60 ...
## $ LotFrontage : int 141 80 81 93 74 78 41 43 39 60 ...
## $ OverallCond : int 5 6 6 5 5 6 5 5 5 5 ...
## $ YearBuilt : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ ExterQual : chr "TA" "TA" "TA" "Gd" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ FirstFlrSF : int 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
## $ SecondFlrSF : int 0 0 0 0 701 678 0 0 0 776 ...
## $ BedroomAbvGr : int 3 2 3 3 3 3 2 2 2 3 ...
## $ TotRmsAbvGrd : int 7 5 6 8 6 7 6 5 5 7 ...
## $ GrLivArea : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ MiscVal : int 0 0 12500 0 0 0 0 0 0 0 ...
## $ YearRemodel : int 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
subdatnum <- subset(mydata, select=c("TotalFloorSF","HouseAge","QualityIndex",
"SalePrice","LotArea","OverallQual","logSalePrice"))
Section 1: Sample Definition • Remove houses with square footage above 4000 because of the scarceness of the data after that point may not allow for the best analysis of data. • Remove houses with quality index below 5 and quality index below 2, as these houses are irregular in status after analyzing the range of values both features could take (with a quality index high of 35 and an overall quality high of 10). These houses seem to be extremely poor in quality and can hence cause a lot of variability in analysis. We should focus on houses with more moderate levels of quality and condition. • Dwellings with MS Zoning classified as Commercial and Industrial should also be eliminated from the analysis and regression to avoid confusion with houses in residential areas like the rest. Commercial and industrial properties are significantly different than residential properties, resulting in a different pricing structure, which may be a point or error in conducting analysis. • Lot shape IR3 also only have a few amounts of examples in this dataset, and it may not be enough to do a proper analysis on, and hence should be dropped. • Sale conditions that are abnormal, Alloca, and Partial should not be considered as the sale prices may vary due to these circumstances, which may cause errors during analysis.
• Houses without typical Functionality should be removed. • Lot area above 100,000 is a lot higher than the mean of lot value, and anything above that value seem to be outliers. Hence, houses with lot sizes should be removed, primarily because of the scarcity of data available above that lot size. • We should also remove entries with N/A in the values being analyzed, so that the analysis may take the values it has into account, and it can prevent errors in the analysis down the line.
Waterfall included Section 2: Data Quality Check • 20 Analyzed features:
Total Floor SF
House Style
Lotshape
House Age MS SubClass
Lot Area Lot Frontage
Overall Qual Overall Cond Year built Exterior Quality Exterior Condition Misc Val 1st Flr SF
2nd Flr SF BedroomAbvGr TotRmsAbvGrd Foundation GrLivArea YearRemodel
• After viewing a histogram of TotalFloorSF, I can see there is a scarce amount of data after about 4000 square feet, so those above that value should be removed. This also has a significant impact on sale price, in my opinion, because house price tends to increase as house size does. • The house style is in correlation with house size, so it may impact the price in a significant way as well. • Lotshape also is related to overall property size. The more property, or higher the lot size, the higher the price should be, holding all else constant. • House age could determine house quality, as newer houses may have more features and may last longer, as well as be in less need of repairs. The better the condition of the house, the higher its price should be. This needed to be edited due to the lack of abundant data for ‘IR3.’ It would be better to eliminate houses with that entry due to the lack of data. • Subclass is also related to house style and house size, as 2 story houses are bigger and have more living area than a 1 story house, and hence could be more expensive holding other variables constant. This feature holds a lot of weight, so should be analyzed in relation to the other features. • Lot area is another feature that contributes to the size of the property. • Lot frontage also contributes to size via lot area, and seems to be a heavy weighted feature. It can also determine if the property is in a more isolated area or is connected to the street. Properties in the two different areas could vary in price significantly. • Overall Conditions and Overall Quality both go hand in hand, and state how well the property is. The higher the value the higher the price typically is, hence making it a high weighted feature. • Year built is related to the age, which may distinguish quality. SOme individuals may also prefer older homes due to material choice, which could contribute heavily to hous price. • Exterior quality and conditions also contribute to how much repair the house needs. If the house is in need of more repairs and both values are low, their prices may be lower as well, making it a heavy weighted feature for analysis. • Miscellaneous Feature Value also adds to property value significantly. Houses with a pool, shed, basketball court, etc. may be a more desirable property and could hence increase the price. • 1st floor square feet and 2nd floor square feet contribute to the size of the property. It could also distinguish the house subclass, making it a high weighted feature. • Bedrooms and total rooms above ground also can determine the size of the property and living space. It could also determine if larger or multiple families could live in the house. Distinguishing between the two could cause variances in price, making it worth including in analysis. • Foundation could determine the material the houses were made of and could help determine the longevity of the property. This needs to be made into numerical values, however. • GR Living Area contributes to size and how much space families have to live in the residence, contributing to price heavily in the end. • The year the property was remodeled could significantly impact the style, features, and durability/quality/condition of the house, making it a heavily weighted feature.
All of these features are what I thought would contribute the most to determining the sale price of the house and are heavily weighted features. Some needed to be edited because of lack of data surrounding the various entries, which may cause greater errors in predictions.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
#####################################################################
############# Data Quality Check ###########################mydata[!complete.cases(mydata),]
##################################################################
print('TotalFloorSF')
## [1] "TotalFloorSF"
hist(subdat7$TotalFloorSF)
summary(subdat7$TotalFloorSF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 492 1092 1418 1464 1717 3820
quantile(subdat7$TotalFloorSF)
## 0% 25% 50% 75% 100%
## 492.0 1092.0 1418.5 1717.0 3820.0
print('Subclass')
## [1] "Subclass"
hist(subdat7$SubClass)
summary(subdat7$SubClass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 20.00 50.00 58.14 70.00 190.00
quantile(subdat7$SubClass)
## 0% 25% 50% 75% 100%
## 20 20 50 70 190
print('LotShape Plot')
## [1] "LotShape Plot"
require(ggplot2)
## Loading required package: ggplot2
ggplot(subdat7) +
geom_bar( aes(LotShape) ) +
ggtitle("Number of houses per Lotshape") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
print('House Age')
## [1] "House Age"
hist(subdat7$HouseAge)
summary(subdat7$HouseAge)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 34.50 36.95 53.00 136.00
quantile(subdat7$HouseAge)
## 0% 25% 50% 75% 100%
## 0.0 9.0 34.5 53.0 136.0
print('Lot Area')
## [1] "Lot Area"
plot(subdat7$LotArea)
summary(subdat7$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7352 9306 9617 11250 70761
quantile(subdat7$LotArea)
## 0% 25% 50% 75% 100%
## 1300.0 7352.5 9305.5 11250.0 70761.0
print('Lot Frontage')
## [1] "Lot Frontage"
hist(subdat7$LotFrontage)
summary(subdat7$LotFrontage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 37.00 60.00 55.23 76.00 313.00
subdat7[is.na(subdat7)] = 0
quantile(subdat7$LotArea)
## 0% 25% 50% 75% 100%
## 1300.0 7352.5 9305.5 11250.0 70761.0
print('House Style')
## [1] "House Style"
require(ggplot2)
ggplot(subdat7) +
geom_bar( aes(HouseStyle) ) +
ggtitle("Number of houses per style") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
print('Overall Qual')
## [1] "Overall Qual"
hist(as.numeric(subdat7$OverallQual))
summary(as.numeric(subdat7$OverallQual))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 6.089 7.000 10.000
quantile(as.numeric(subdat7$OverallQual))
## 0% 25% 50% 75% 100%
## 3 5 6 7 10
mean(as.numeric(subdat7$OverallQual))
## [1] 6.089427
print("Overall Cond")
## [1] "Overall Cond"
hist(as.numeric(subdat7$OverallCond))
summary(as.numeric(subdat7$OverallCond))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 5.000 5.664 6.000 9.000
quantile(as.numeric(subdat7$OverallCond))
## 0% 25% 50% 75% 100%
## 2 5 5 6 9
mean(subdat7$OverallCond)
## [1] 5.663877
print("YearBuilt")
## [1] "YearBuilt"
hist(subdat7$YearBuilt)
summary(subdat7$YearBuilt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1955 1973 1971 1998 2010
quantile(subdat7$YearBuilt)
## 0% 25% 50% 75% 100%
## 1872 1955 1973 1998 2010
print("ExterQual")
## [1] "ExterQual"
summary(mydata$ExterQual)
## Length Class Mode
## 2930 character character
print("ExterCond")
## [1] "ExterCond"
summary(mydata$ExterCond)
## Length Class Mode
## 2930 character character
print("FirstFlrSF")
## [1] "FirstFlrSF"
hist(subdat7$FirstFlrSF)
summary(subdat7$FirstFlrSF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 442 864 1052 1125 1335 3820
quantile(subdat7$FirstFlrSF)
## 0% 25% 50% 75% 100%
## 442.00 864.00 1052.00 1334.75 3820.00
print("SecondFlrSF")
## [1] "SecondFlrSF"
hist(subdat7$SecondFlrSF)
summary(subdat7$SecondFlrSF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 339.3 707.0 1836.0
quantile(subdat7$SecondFlrSF)
## 0% 25% 50% 75% 100%
## 0 0 0 707 1836
print("BedroomAbvGr")
## [1] "BedroomAbvGr"
hist(subdat7$BedroomAbvGr)
summary(subdat7$BedroomAbvGr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 2.864 3.000 6.000
quantile(subdat7$BedroomAbvGr)
## 0% 25% 50% 75% 100%
## 0 2 3 3 6
print("TotRmsAbvGrd")
## [1] "TotRmsAbvGrd"
hist(subdat7$TotRmsAbvGrd)
summary(subdat7$TotRmsAbvGrd)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 6.354 7.000 13.000
quantile(subdat7$TotRmsAbvGrd)
## 0% 25% 50% 75% 100%
## 3 5 6 7 13
print("Foundation")
## [1] "Foundation"
summary(subdat7$Foundation)
## Length Class Mode
## 0 NULL NULL
print("miscVal")
## [1] "miscVal"
hist(subdat7$MiscVal)
summary(subdat7$MiscVal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 48.5 0.0 15500.0
quantile(subdat7$MiscVal)
## 0% 25% 50% 75% 100%
## 0 0 0 0 15500
print("GrLivArea")
## [1] "GrLivArea"
hist(subdat7$GrLivArea)
summary(subdat7$GrLivArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 492 1095 1422 1467 1718 3820
quantile(subdat7$GrLivArea)
## 0% 25% 50% 75% 100%
## 492.00 1095.25 1422.00 1717.75 3820.00
print("YearRemodel")
## [1] "YearRemodel"
hist(subdat7$YearRemodel)
summary(subdat7$YearRemodel)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1950 1966 1992 1984 2002 2010
quantile(subdat7$YearRemodel)
## 0% 25% 50% 75% 100%
## 1950 1966 1992 2002 2010
Section 3: Initial Exploratory Data Analysis • Out of the twenty features checked above, these ten seemed to be the more important features, as they were more numeric and easily analyzable. After graphing the variables, I could see the distribution of the variables. The ones with relatively normal distributions could be analyzed well, while some may need more preprocessing before it can be included in the model, such as Second floor square footage, as a lot of it is skewed to the right. This may be due to the abundance of homes that are one story, resulting in 0 values for 2nd floor square footage. Upon analysis this should be replaced, along with 1st floor square footage to just include all of the living area in the property, by including the feature ‘GrLivArea.’
#################################################################
################## univariate EDA ##############################
###############################################################
require(ggplot2)
ggplot(subdat7) +
geom_bar( aes(LotShape) ) +
ggtitle("Number of houses per Lotshape") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=SalePrice)) +
geom_histogram(color="black", binwidth= 10000) +
labs(title="Distribution of Sale Price") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=TotalFloorSF)) +
geom_histogram(color="black", binwidth= 100) +
labs(title="Distribution of TotalFloorSF") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=QualityIndex)) +
geom_histogram(color="black", binwidth= 10) +
labs(title="Distribution of QualityIndex") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=LotArea)) +
geom_histogram(color="black", binwidth= 10) +
labs(title="Distribution of LotArea") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=FirstFlrSF)) +
geom_histogram(color="black", binwidth= 10) +
labs(title="Distribution of First Floor SF") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=SecondFlrSF)) +
geom_histogram(color="black", binwidth= 10) +
labs(title="Distribution of Second FL SF") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=GrLivArea)) +
geom_histogram(color="black", binwidth= 10) +
labs(title="Distribution of GR Liv Area") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=YearBuilt)) +
geom_histogram(color="black", binwidth= 10) +
labs(title="Distribution of Year Built") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=YearRemodel)) +
geom_histogram(color="black", binwidth= 10) +
labs(title="Distribution of YearRemodel") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
#######################################################################
########### bivariate EDA ########################################
###################################################################
ggplot(subdat7, aes(x=TotalFloorSF, y=QualityIndex)) +
geom_point(color="blue", shape=1) +
ggtitle("Scatter Plot of Total Floor SF vs QualityIndex") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=TotalFloorSF, y=HouseAge)) +
geom_point(color="blue", shape=1) +
ggtitle("Scatter Plot of Total Floor SF vs HouseAge") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=LotShape, y=HouseAge)) +
geom_boxplot(fill="blue") +
labs(title="Distribution of HouseAge by Lotshape") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
Section 4: Exploratory Data Analysis for Modeling • I chose the three variables I thought would have the most linear relationship to sale price, as well as a categorical feature: TotalFloorSf, QualityIndex, and Lotshape (being the categorical). TotalFloorSF and QualityIndex seem to have quite the liner relationship to sale price, as they both seem to increase with each other for the most part. However, it is not extremely linear, showing the importance the other features may have on the sale price as well. Lotshape does not seem to have much of a relationship to sale price, as according to the boxplot, it is all over the place. However, this shows how many outliers are in this feature, and how it may need more preprocessing before being included in the model. This allows us to get an insight as to which features are heavily weighted and appear necessary for proper and more accurate analysis while which others need to be edited, omitted, or replaces in the model.
############################################################
################ model focussed EDA #######################
###########################################################
ggplot(subdat7, aes(x=TotalFloorSF, y=SalePrice)) +
geom_point(color="blue", size=2) +
ggtitle("Scatter Plot of Sale Price vs Total Floor SF") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5)) +
geom_smooth(method=lm, se=FALSE) ## method=lm, se=FALSE ###
## `geom_smooth()` using formula 'y ~ x'
ggplot(subdat7, aes(x=QualityIndex, y=SalePrice)) +
geom_point(color="blue", shape=1) +
ggtitle("Scatter Plot of Sale Price vs QualityIndex") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=LotShape, y=SalePrice)) +
geom_boxplot(fill="blue") +
labs(title="Distribution of Sale Price") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=YearRemodel, y=SalePrice)) +
geom_point(color="blue", size=2) +
ggtitle("Scatter Plot of Sale Price vs Total Floor SF") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5)) +
geom_smooth(method=lm, se=FALSE) ## method=lm, se=FALSE ###
## `geom_smooth()` using formula 'y ~ x'
ggplot(subdat7, aes(x=YearBuilt, y=SalePrice)) +
geom_point(color="blue", shape=1) +
ggtitle("Scatter Plot of Sale Price vs QualityIndex") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
ggplot(subdat7, aes(x=LotArea, y=SalePrice)) +
geom_boxplot(fill="blue") +
labs(title="Distribution of Sale Price") +
theme(plot.title=element_text(lineheight=0.8, face="bold", hjust=0.5))
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
#####################################################################
############# EDA for multiple variables ###########################
##################################################################
require(GGally)
## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(subdat7, cardinality_threshold=NULL)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
require(lattice)
## Loading required package: lattice
pairs(subdatnum, pch = 21)
require(corrplot)
## Loading required package: corrplot
## corrplot 0.84 loaded
mcor <- cor(subdatnum)
corrplot(mcor, method="shade", shade.col=NA, tl.col="black",tl.cex=0.5)
Section 5: Summary/Conclusions • This assignment allowed me to explore the data and clean it up to find features that may be the most beneficial to creating a model, or at least those that may have the greatest impact. It also allowed me to warm-up on my R and R Studio skills to get ready for the assignments to come, which may involve modeling and more statistical analysis. It prepared me to read the data and visualize which ones may be well implemented if it were to be plugged into a model, and which ones need editing, should be combined with others for better characterization of a feature, and omitted as they may not contribute to the model in a positive way. The various visual aspects conducted in this assignment helped further develop the skillet of analyzing data and exploring ways to create a more accurate model.